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Abstract — We study the problem of gambling in horse races 
with causal side information and show that Massey's directed 
information characterizes the increment in the maximum achiev- 
able capital growth rate due to the availability of side infor- 
mation. This result gives a natural interpretation of directed 
information I(Y n — > X n ) as the amount of information that Y n 
causally provides about X n . Extensions to stock market portfolio 
strategies and data compression with causal side information are 
also discussed. 

I. Introduction 

Mutual information arises as the canonical answer to a va- 
riety of problems. Most notably, Shannon [1] showed that the 
capacity C, the maximum data rate for reliable communication 
over a discrete memoryless channel p(y\x) with input X and 
output Y, is given by 



C = max/(X;Y), 

■p(x) 



(1) 



which leads naturally to the operational interpretation of mu- 
tual information I(X; Y) = H(X) - H(X\Y) as the amount 
of uncertainty about X that can be reduced by observation 
Y, or equivalently, the amount of information Y can provide 
about X. Indeed, mutual information I(X;Y) plays the cen- 
tral role in Shannon's random coding argument, because the 
probability that independently drawn X" and Y n sequences 
"look" as if they were drawn jointly decays exponentially 
with exponent I(X; Y). Shannon also proved a dual result 
[2] showing that the minimum compression rate R to satisfy 
a certain fidelity criterion D between the source X and its 
reconstruction X is given by R(D) = mm p M x -) I(X;X). In 
another duality result (Lagrange duality this time) to (Q3, Gal- 
lager [3] proved the minimax redundancy theorem, connecting 
the redundancy of the universal lossless source code to the 
capacity of the channel with conditional distribution described 
by the set of possible source distributions. 

Later on, it was shown that mutual information has also an 
important role in problems that are not necessarily related to 
describing sources or transferring information through chan- 
nels. Perhaps the most lucrative example is the use of mutual 
information in gambling. 

Kelly showed in [4] that if each horse race outcome can 
be represented as an independent and identically distributed 
(i.i.d.) copy of a random variable X and the gambler has some 
side information Y relevant to the outcome of the race, then 
under some conditions on the odds, the mutual information 
I(X;Y) captures the difference between growth rates of the 



optimal gambler's wealth with and without side information 
Y, Thus, Kelly's result gives an interpretation that mutual 
information I(X; Y) is the value of side information Y for 
the horse race X. 

In order to tackle problems arising in information systems 
with causally dependent components, Massey [5] introduced 
the notion of directed information as 

n 

/(i B -»r) ^^/(X^FilY*- 1 ), 

i=l 

and showed that the maximum directed information upper 
bounds the capacity of channels with feedback. Subsequently, 
it was shown that Massey's directed information and its 
variants indeed characterize the capacity of feedback and two- 
way channels [6]— [ 13] and the rate distortion function with 
feedforward [14]. 

The main contribution of this paper is showing that directed 
information I(Y n — > X n ) has a natural interpretation in 
gambling as the difference in growth rates due to causal side 
information. As a special case, if the horse race outcome and 
the corresponding side information sequences are i.i.d., then 
the (normalized) directed information becomes a single letter 
mutual information I(X; Y), and it coincides with Kelly's 
result. 

The paper is organized as follows. We describe the notation 
of directed information and causal conditioning in Section [TT] 
In Section |lll] we formulate the horse-race gambling problem, 
in which side information is revealed causally to the gambler. 
We present the main result in Section [IV] and an analytically 
solved example in Section[V] Finally, Section[VT]concludes the 
paper and states two possible extensions of this work to stock 
market and data compression with causal side information. 

II. Directed information and causal conditioning 

Throughout this paper, we use the causal conditioning nota- 
tion (• 1 1 •) developed by K-amer [6]. We denote as p(x n | \y n ~ d ) 
the probability mass function (pmf) of X n — (Xi, . . . ,X n ) 
causally conditioned on Y n ~ d , for some integer d > 0, which 
is defined as 



P(x n \\y n - d )±l[p(xi 



\y l - d )- 



i=l 



(By convention, if i — d < then x l d is set to null.) In 
particular, we use extensively the cases d = 0, 1: 



n 

pix^- 1 ) ^Hpixilx'- 1 ,^- 1 ). 
1=1 

Using the chain rule, we can easily verify that 

p(x n ,y n )=p(x n \\y n )p(y n \\x n - 1 ). 
The causally conditional entropy H(X n \\Y n ) is defined as 
H(X n \\Y n ) ± E[logp(X n ||F n )] 

n 

= Y J H{X l \X l -\Y l ). 

i=l 

Under this notation, directed information can be written as 

n 

I(Y n -+ X n ) = I(Xi; Y'lX 1 - 1 ) 

i=l 

= H(X n ) - H(X n \\Y n ), 

which hints, in a rough analogy to mutual information, a pos- 
sible interpretation of directed information I(Y n — > X n ) as 
the amount of information causally available side information 
Y n can provide about X n . 

Note that the channel capacity results involve the term 
I(X n — » Y n ), which measure the information in the forward 
link X n — > Y n . In contrast, in gambling the gain in growth 
rate is due to the side information (backward link), and 
therefore the expression I(Y n — > X n ) appears. 

III. Gambling in horse races with causal side 

INFORMATION 

Suppose that there are m racing horses in an infinite 
sequence of horse races and let X; G X = [1, 2, ...,m], 
i = 1,2,..., denote the horse that wins at time i. Before 
betting in the i-th horse race, the gambler knows some side 
information Yi 6 y. We assume that the gambler invests all his 
capital in the horse race as a function of the information that he 
knows at time i, i.e., the previous horse race outcomes X 1 ^ 1 
and side information Y % up to time i. Let b(xi\x l ~ 1 , y % ) be the 
proportion of wealth that the gambler bets on horse xi given 
X 1 ^ 1 — x 1 ^ 1 and Y l = y l . The betting scheme should satisfy 
y % ) > (no short) and J2 X ^{x^x 1-1 , y l ) — 1 for 
any history a; 1-1 , y l . Let o(xi\x l ~ 1 ) denote the odds of a horse 
Xi given the previous outcomes x 1 " 1 , which is the amount of 
capital that the gambler gets for each unit capital invested 
in the horse. We denote by S(x n \\y n ) the gambler's wealth 
after n races where the race outcomes were x n and the side 
information that was causally available was y n . The growth, 
denoted by W(X™||Y n ), is defined as the expected logarithm 
(base 2) of the gambler's wealth, i.e., 



Finally the growth rate —W(X n \\Y n ) is defined as the nor- 
malized growth. 

Here is a summary of the notation: 

• Xi is the outcome of the horse race at time i. 

• Yi is the the side information at time i. 

• o(Xi\X 1 ^ 1 ) is the payoffs at time i for horse Xi given 
that in the previous race the horses X 1 ^ 1 won. 

• b(Xi\Y l , X l ~ 1 ) the fractions of the gambler's wealth 
invested in horse Xi at time i given that the outcome 
of the previous races are X 1 ^ 1 and the side information 
available at time i is Y % . 

• S(X n \\Y n ) the gambler's wealth after n races when the 
outcomes of the races are X n and the side information 
Y n is causally available. 

. kW{X n \\Y n ) is the growth rate. 

Without loss of generality, we assume that the gambler's 
capital is 1 initially; therefore Sq = 1. 

IV. Main Results 

In Subsection IIV-AI we assume that the gambler invests all 
his money in the horse race while in Subsection IIV-BI we 
allow the gambler to invest only part of the money. Using 
Kelly's result, it is shown in Subsection IIV-BI that if the odds 
are fair with respect to some distribution then the gambler 
should invest all his money in the race. 

A. Investing all the money in the horse race 

We assume that at any time n the gambler invests all his 
capital and therefore 

S(X n \\Y n ) = b{X n \X n -)Y n )o(X n \X n - 1 )S(X n - 1 \\Y n - 1 ). 
This also implies that 

n 

s(x n \\Y n ) = ]J&(x i |x J - 1 ,r i )o(x t |x i - 1 ). 

i=l 

The following proposition characterizes the optimal betting 
strategy and the corresponding growth of wealth. 

Theorem 1: For any finite horizon n, the maximum growth 
rate is achieved when the gambler invests the money propor- 
tional to the causal conditioning distribution, i.e., 

b*{x i \x i - 1 ,y i )=p{x i \x^,y i ), \fx l ,y\t<n, (3) 

and the growth is 

W*{X n \\Y n ) = E[logo(X™)] - H(X n \\Y n ). 

Note that the sequence {p(xi\x 1 ^ 1 , uniquely de- 

termines p(x n \\y n ). Also for all pairs (x n ,y n ) such that 
p{x n \\y n ) > 0, the sequence {p(xi\x l ~ 1 , is deter- 

mined uniquely by p(x n \\y n ) simply by the identity 

p{xi\x l -,y 



, , _ p(x l \\y l ) 

p{x l ~ 1 \\y l ~ 1 ) 



W(X n \\Y n ) ^E[\ogS(X n \\Y n )] 



(2) 



A similar argument applies for {b*(xi\x z and 
b* (x n \\y n ), and therefore (0) is equivalent to 

b*(x n \\y n )=p(x n \\y n ), Vx n e X n ,y n e y n . 



Proof of Theorem Q} We have 
W *( X n \\Y n )= max E[log b(X n \ \Y n )o(X n )] 

b(x n \\y n ) 

= max E[log6(X n ||r™)] +E[logo(X' 1 )] 

b{x n \\y n ) 

= -H(X n \\Y n )+E[logo(X n )}, 

where the last equality is achieved by choosing b(x n \\y n ) = 
p(x n \\y n ), and it is justified by the following upper bound 



E[log&(X n ||Y")] 



x n ,y n 



-H(X n \\Y r 



(a) 

< -H(X n \\Y r 



(b) 

< -H(X n \\Y n ) 
= -H(X n \\Y n ) 



\ogp(x n \\y n )+log 

x n ,y n 

hlog £ p(x n ,y n ) 

x n ,y n 

-log £ p^Wx^b^Wy") 



b(x n 


\v n ) 


p(x n 


\y n ) 


b{x n \ 


y n ) 


p(x n \ 


\y n ) 


b{x n 


\y n ) 


p(x n 


\y n ) 



(4) 



where (a) follows from Jensen's inequality and (b) from the 
fact that £ x » np(y n \\x n - 1 )b(x n \\y n ) = 1. All summations 
in (HJi are over the arguments (x n ,y n ) for which p(x n , y n ) > 
0. This ensures that p(x n \\y n ) > 0, and therefore, we can 
multiply and divide by p(x n \\y n ) in the first step of ■ 

In the case that the odds are fair and uniform, i.e., 

o{X i \X i ~ i -) = fa, then 

-W*(X n \\Y n ) = log |;t| - -H(X n \\Y n ). 

n n 

Thus the sum of the growth rate ^W(X™||F n ) and the en- 
tropy rate ±H(X n \\Y n ) of the horse race process conditioned 
causally on the side information is constant, and one can see 
a duality between H(X n \\Y n ) and W*(X n \\Y n ); cf. [15, 
th. 6.1.3] 

Let us denote by AW the increase in the growth rate due 
to causal side information, i.e., 



AW = -W*(X n \\Y n ) - -W*(X n ). 
n n 



(5) 



Thus AW characterizes the value of side information Y n . 
Theorem [T] leads to the following proposition, which gives a 
new operational meaning of Massey's directed information. 

Corollary 1: The increase in growth rate due to causal side 
information Y n for horse races X n is 

AW = -I(Y n — ► X n ). (6) 



Proof: From Theorem Q] we have 

W*{X n \\Y n ) - W*(X n ) = -H(X n \\Y n ) - 

= I(Y n -v X n ). 



H(X r 



B. Investing only part of the money 

In this subsection we consider the case where the gambler 
does not necessarily invest all his money in the gambling. Let 
bo(y l ,x l ~ 1 ) be the portion of money that the gambler does 
not invest in gambling at time i given that the previous races 
results were x 1 ^ 1 and the side information is y l . In this setting, 
the wealth is given by 

S(X n \\Y n ) 

n 

= n(M^~\n + {KXiix^^oiXiix*- 1 )), 

i=l 

and the growth W r (X n ||Y n ) is defined as before in (0. 

The term W(X n ||Y n ) obeys a chain rule similar to the 
causal conditioning entropy definition H(X n \\Y n ), i.e., 

n 

W(X n \\Y n ) =J2W(X i \X i -\Y i ), 

i=l 

where 

W{X i \X i - 1 ,Y i - 1 ) 

4 EllogiboiX^^+biXilX^^XXilX'- 1 ))] . 

Note that for any given history (a; 4_1 ,j/ 4 ) e X l ~ x x y % , the 
betting scheme {^(x 1 ' 1 ,y l ),b(xi\x z ~ 1 ,y 1 )} influences only 
WiXilX*- 1 ^), so that we have 

max . W(X n \\Y n ) 

{b (x z 1 ,y'),b(x i \x z 1 ,y')}^ =1 
n 

= V max WiXilX*- 1 ,^) 

4 6 (x'- 1 ,y I ),b(a:i|a: I - 1 ,a») 



i=l 
n 

E E 

i=l x*- 1 ,y % 



P(x l ~\y l ) 



max 



b (x* 1 ,y i ),b{x i \x i 1 ,y') 



The optimization problem in the last equation is equivalent 
to the problem of finding the optimal betting strategy in 
the memoryless case where the winning horse distribution 

p{x) is p(x) — Pr(Xi — xlx 1 ^ 1 ,y l ), the odds o(x) are 
o(x) = o(Xi = a;|a; 4_1 ), and the betting strategy (bo,b(x)) 
is (bo(x l , b(Xi — xlx 1 ^ 1 , y 1 )), respectively. Hence, 

the optimization, ma.xW(Xi\x 1 ^ 1 ,y l ), is equivalent to the 
following convex problem: 



maximize 



p(x) log(6 + b(x)o(x)) 



subject to bo + b(x) = 1, 

X 

b Q > 0, b(x) > 0, Va; G X. 

The solution to this optimization problem was given by 
Kelly [4]. If the odds are super-fair, namely, ^2 X < 1, 
then the gambler will invest all his wealth in the race rather 
than leave some as cash, since by betting b(x) = where 
c = ^77)' tri e gambler's money will be multiplied by 

c > 1, regardless of the race outcome. Therefore, for this case, 
the solution is given by Theorem Q] where the gambler invests 
proportional to the causal conditioning distribution p(x n \\y n ). 



If the odds are sub-fair, i.e., J2 X ^h) > 1> men it is optimal 
to bet only some of the money, namely bo > 0. The solution 
to this problem is given in terms of an algorithm in [4, p. 925]. 

V. An example 

Here we consider betting in a horse race, where the wining 
horse can be represented as a Markov process, and causal side 
information is available. 

Example 1: Consider the horse race process depicted in 
Figure Q] where two horses are racing and the winning horse 
Xi behaves as a Markov process. A horse that won will win 
again with probability 1 — p and lose with probability p. At 
time zero, we assume that both horses have probability ^ of 
wining. The side information Yi at time i is a noisy observation 
of the horse race outcome Xi. It has probability 1 — q of being 
equal to Xi, and probability q of being different from Xi. 

For this example, the increase in growth rate due to side 
information as n goes to infinity is 

AW = h(p*q) -h{q), 

where the function h(-) denotes the binary entropy, i.e., 
h(x) — —xlogx — (1 — ir)log(l — x), and p * q denotes 
the parameter of a Bernoulli distribution that results from 
convolving two Bernoulli distributions with parameters p and 
q, i.e., p * q = (1 — p)q + (1 - q)p. 

The increase in the growth rate AW for this example can 
be obtained using first principles as follows: 

AW 

= lim -I(Y n -> X n ) 

n — *oo 77, 

i - 

= lim -Y 1 IHY'IX 1 - 1 ) -HiY^X 1 ) 

n — >oo n — ' 

i=l 

1 ™ 

= lim - V [tfCH^-i) - H{Yi\Xl) - H{Y 1 \X 1 )] 

n — >oo n * — » 

i=l 
n 

( = } lim - V [HiY'lXi-!) - HiY^lX 1 - 1 ) - H^X,)} 

i=l 
n 

= lim -^[Hiy^- 1 ,**-!) ~ H{Y x \Xx)) 
i=l 

( =' H(Yi\X ) - ff(Fi|X x ) = h{p * q) - h(q), (7) 

where steps (a) and (b) are due to the stationarity of the process 
(Xi,Yi). Alternatively, the sequence of equalities up to step 
(b) in ^ can be derived directly using 

-I(Y n -X n ) ( " } - V UY.-X^X 1 - 1 ^ 1 - 1 ) 
i=i 

^^YilXo)-^!^), (8) 

where (a) is the identity given in [11, eq. (9)] and (b) is due 
to the stationarity of the process. 



Horse 1 wins P Horse 2 wins 




Fig. 1. The setting of Example 1. The winning horse Xi is represented as 
a Markov process with two states. In state 1, horse number 1 wins, and in 
state 2, horse number 2 wins. The side information, Yi, is noisy observation 
of the wining horse, Xi. 

If the side information is known with some lookahead k 6 
{0, 1, ...}, that is, if the gambler knows Y l+k at time i, then 
the increase in growth rate is given by 

AW = lim -I(Y n+k -» X n ) 

n—*oo fi 

= H(Y k+1 \Y k ,X )~H(Y 1 \X 1 ), (9) 

where the last equality is due to the same arguments as in ([8]). 

Figure |2] shows the increase in growth rate AW due to side 
information as a function of the side information parameters 
(q,k). The left plot shows AW as a function of q, where 
p = 0.2 and no lookahead, k = 0. The right plot shows AW 
as a function of k, where p = 0.2 and q = 0.25. If the entire 
side information sequence Y±, Y2, ... is known to the gambler 
ahead of time, then we should have mutual information rather 
then directed information, i.e., 

AW = lim -I{Y n -X n ) 

n— >oo n 

= lim — i -HfY^Xx), (10) 

n — >oo 12 

and this coincides with the fact that for a stationary hidden 
Markov process {Yi, Y2, ■••} the sequence H(Yk+i\Y k ^ 1 , Xq) 
converges to the entropy rate of the process. 

VI. Conclusion and further extensions 

We have shown that directed information arises naturally 
in gambling as the gain in the maximum achievable capital 
growth due to the availability of causal side information. We 
now outline two extensions: stock market portfolio strategies 
and data compression in the presence of causal side informa- 
tion. Details are given in [16]. 

A. Stock market 

Using notation similar to that in [15, ch. 16], a stock 
market at time i is represented as a vector of stocks = 
(Xn, Xi2, Xim), where m is the number of stocks, and the 
price relative Xik is the ratio of the price of stock-fc at the end 
of day i to the price of stock- k at the beginning of day i. We 
assume that at time i there is side information Y % that is known 
to the investor. A portfolio is an allocation of wealth across 
the stocks. A nonparticipating or causal portfolio strategy with 





^(YilYo^-i, 
-H{Yx\X{) 



q k 
Fig. 2. Increase in the growth rate, in Example [T] as a function of the side information parameters (q, k). The left plot of the figure shows the increase of 
the growth rate AW as a function of q = Pr(Xi 7^ Yi) and no lookahead. The right plot shows the increase of the growth rate as function of lookahead k, 
where q = 0.25. The horse race outcome is assumed to be a first-order binary symmetric Markov process with parameter p = 0.2. 



causal side information at time i is denoted as b(x' 1 , y l ), and 
it satisfies XT=i &k(x <- \y*) = 1, and ^(X^F*) > for 
all possible x l ,y\ We define S(x. n \\y n ) as the wealth at 
the end of day n for a stock sequence x" and causal side 
information y n . We can write 

S(x n \\y n ) = (b'(x* -1 , »")*») ^(x"- 1 ^"- 1 ) 

where (•)* denotes the transpose of a vector. The goal is to 
maximize the growth W{X. n \\Y n ) = E[log 5(X"| \Y n )]. We 
also define W(K n \X n - 1 , Y n ) = E[log(b*(X n -\ Y n )X„)]. 
From this definition, we can write the chain rule 

n 

w(x n \\Y n ) = ^w r (x. i |x 4 - 1 ,y 4 ). 

i=l 

The gambling in horse races with to horses studied in the 
previous section is a special case of investing the stock market 
with to + 1 stocks. The first m stocks correspond to the m 
horses and at the end of the day one of the stocks, say k G 
{1, ...,to}, gets the value o(k) with probability p(k) and all 
other stocks become zero. The to + 1-st stock is always one, 
and it allows the gambler to invest only part of the wealth in 
the horse race. 

The developments in the previous section can be expanded 
to characterize the increase in growth rate due to side infor- 
mation, where again directed information emerges as the key 
quantity, upper-bounding the value of causal side information; 
cf. [17]. Details will be given in [16]. 

B. Instantaneous compression with causal side information 

Let X 1 , X2 , ■ ■ ■ be a source and li , Y2 , . . . its side in- 
formation sequence. The source is to be losslessly encoded 
instantaneously, with causal available side information. More 
precisely, an instantaneous lossless source encoder with causal 
side information consists of a sequence of mappings {M^}j>i 
such that each Mj : X i x y % — > {0,1}* has the property 
that for every x % ~ x and y % Mi(x l ~ 1 ■ , y % ) is an instantaneous 
(prefix) code for Xi. 

An instantaneous lossless source encoder with causal side 
information operates sequentially, emitting the concatenated 
bit stream Mi(Xi,Yi)M 2 (X 2 , Y 2 ) ■ ■ ■ . The defining property 
that Mj(o: 1-1 -, y l ) is an instantaneous code for every x % ~ x and 
y % is a necessary and sufficient condition for the existence of 



a decoder that can losslessly recover x l based on y % and the 
bit stream M\{xy^ y\)M2{x 2 , y 2 ) ■ ■ ■ just as soon as it sees 
Mi(xi,yi)M2(x 2 ,y 2 ) ■ ■ ■ Mi(x l ,y l ), for all sequence pairs 
(xi, yi), (x2, 2/2) • • ■ and all i > 1. Using natural extensions 
of standard arguments we show in [16] that I(Y n — > X n ) is 
essentially (up to terms that are sublinear in n) the rate savings 
in optimal sequential lossless compression of X n due to the 
causal availability of the side information. 
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